tidyverseUse the p_load function to install and load the
tidyverse:
library(pacman)
p_load(tidyverse)The data file is 04-Introduction_tidyverse_data.csv. You
can find it on Canvas under Files then Labs.
Download the file and save it somewhere on your computer.1
The data include drug overdose deaths from the CDC and poverty and unemployment rates from the University of Kentucky Poverty Research Center.
Import the data using read_csv:
state_data <- read_csv("04-Introduction_tidyverse_data.csv")You can preview the structure of the data with
glimpse:
glimpse(state_data)## Rows: 969
## Columns: 9
## $ stname <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ year <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
## $ deaths <dbl> 169, 46, 511, 113, 2662, 349, 310, 50, 48, 997, 283,…
## $ population <dbl> 4430143, 624781, 5023824, 2651860, 33499204, 4226018…
## $ division <chr> "East South Central", "Pacific", "Mountain", "West S…
## $ region <chr> "South", "West", "West", "South", "West", "West", "N…
## $ stabbr <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC"…
## $ poverty_rate <dbl> 15.2, 7.6, 12.2, 14.7, 14.0, 8.5, 7.2, 10.4, 14.7, 1…
## $ unemployment_rate <dbl> 4.7, 6.5, 4.4, 4.6, 5.2, 3.1, 2.9, 3.4, 6.4, 3.9, 3.…
The glimpse output shows that drug overdose deaths
(deaths) are counts. Use the mutate function
to create a new variable that expresses overdoses deaths as rates
(deaths per 100,000 people):
state_data <- state_data %>%
mutate(death_rate = deaths/population*100000)Review of the %>% operator
state_data %>% mutate(death_rate = deaths/population*100000)
as “take the object state_data and then give it to
mutate().”%>% means “and then.”Summary statistics can give you a quick sense of the central tendency and spread of the variables in your data. Common summary statistics include the mean, median, min, max, and standard deviation.
You can produce summary statistics using the summarize
function:
# start with summary stats for overdose death rates
state_data %>%
summarize(min = min(death_rate),
max = max(death_rate),
median = median(death_rate),
mean = mean (death_rate),
sd = sd(death_rate))## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.82 53.6 11.6 12.6 6.69
What’s the minimum overdose death rate? What’s the maximum overdose death rate?
The code above produces summary statistics for a single variable. It
is common practice to generate summary stats for all of the numerical
variables in your dataset. One way you could do this is to copy-paste
the code above and replace death_rate a different a
variable until you have summary stats for each variable. Generally
speaking, this is bad practice. Constantly repeating yourself in your
code can 1) propagate errors and 2) make it difficult to scale-up your
code later (e.g., including additional variables).
For repetitive tasks, you can define functions:
summ_fun <- function(x){
state_data %>%
summarize(min = min(x),
max = max(x),
median = median(x),
mean = mean (x),
sd = sd(x))
}Functions require inputs. The x is the input to the
function defined above. It serves as a placeholder for a variable name
from state_data.
You can use summ_fun to obtain summary statistics for
each variable:
summ_fun(state_data$death_rate)## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.82 53.6 11.6 12.6 6.69
summ_fun(state_data$poverty_rate)## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.5 23.1 12.1 12.6 3.38
summ_fun(state_data$unemployment_rate)## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.3 13.7 5.2 5.64 1.97
summ_fun(state_data$population)## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 491780 39536653 4146101 5946471. 6667352.
summ_fun(state_data$year)## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1999 2017 2008 2008 5.48
Putting state_data$ before the variable name lets R know
where to look for the variable. For example,
state_data$death_rate says “hey R, look for
death_rate inside the object
state_data.”
We are going to use ggplot() function, aka
grammar of graphics included in ggplot2
package. At this point you don’t need to load ggplot2
package explicitly as it’s already loaded when you’ve loaded
tidyverse package.
?ggplot
# ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
# data argument: Default dataset to use for plot
# mapping argument: Default list of aesthetic mappings to use for plotggplot(data = state_data)# If you want to specify the x axis and y axis, you could do so by
ggplot(data = state_data, mapping=aes(x = year, y = death_rate))# Nothing appears! You want to add the geoms -- visual marks that present data points
# Here's one with representing data with points
# Note that you want to layer up the plot by using `+` not `%>%`.
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_point()# Here's one with representing data with line
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_line()# You could layer up multiple geoms in one plot by connecting them with `+`.
# Following code layered the `geom_point` and `geom_line` on the plot.
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_point() + geom_line()# What if you want to label x-axis and y-axis and also add title to the plot?
# You add another layer of `labs` which is short for `labels`
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") # We can use pipe operator here by pulling out the first argument of ggplot function,
# place it in front of `ggplot()`, then connect it with `%>%`.
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") # If you want to color differently by region, specify color argument inside `aes()`.
# Then add the grouping variable
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, color=region)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") # Something didn't quite work though.
# It looks like `ggplot` connected all the dots
# when you wanted it to connect the dots by state.
# You can fix this with the `color` argument in `aes`.
# What if we group the data by states instead of region?
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, color=stname)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") # We can easily create different plots for each region using `facet_wrap()`.
# The tilde '~' can be read as "by".
# This means that we facet *by* region.
# Notice that we changed `color=stname` to `group=stname`.
# This just groups the data by `stname` but does not color differently by each state.
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, group=stname)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
facet_wrap(~region)While the plot looks like spaghetti, you can see that drug overdose death rates increase over time. There is an especially pronounced increase in a handful of states starting around 2010. What do you think is driving these increases?
To get a better sense of where these changes are happening, you can
facet your plot by region using facet_wrap():
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_point()+
geom_line() +
facet_wrap(~division)+
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")division refers to census
divisions.
You can change the appearance of your plot by specifying a theme:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_line() +
facet_wrap(~division) +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()What about national trends? To examine national overdose death rates
over time, you can aggregate the state data using group_by
and summarize. Now notice that since we are aggregating up
the death rate, we are taking the weighted average
where the weight here corresponds to population of each state to
calculate the national death rate:
state_data %>%
group_by(year) %>%
summarize(death_rate = weighted.mean(x = death_rate, w = population)) %>%
ggplot(aes(x = year, y = death_rate)) +
geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()Do overdoses correlate with economic conditions?
state_data %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()state_data %>%
ggplot(aes(x = unemployment_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()You can calculate correlation coefficients using
cor:
cor(state_data$death_rate, state_data$poverty_rate)## [1] 0.2167742
cor(state_data$death_rate, state_data$unemployment_rate)## [1] 0.2065104
Are the correlations between economic conditions and overdose deaths positive or negative? How strong are the correlations? Are they relatively strong? Relatively weak?
There are some pitfalls to examining bivariate relationships with
data from several different years. You can learn more about these in a
Econometrics class or a statistics class. For now, you can sidestep some
of these issues by focusing your attention on the most recent year in
the sample period. Using the filter function, you can
restrict the sample to observations from 2017:
state_data_2017 <- state_data %>%
filter(year == 2017)Note the use of == instead of =.
== is for logical comparisons while = for
assignment.2
state_data_2017 %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw() state_data_2017 %>%
ggplot(aes(x = unemployment_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw()cor(state_data_2017$death_rate, state_data_2017$poverty_rate)## [1] 0.0552481
cor(state_data_2017$death_rate, state_data_2017$unemployment_rate)## [1] 0.3593693
Do the correlations change when you restrict the sample to 2017?
In case you were wondering, the deaths of despair explanation for the opioid crisis doesn’t really hold up to closer scrutiny.
lm()Now let’s look at how to create a fitted linear line that minimizes the sum of residuals on plots.
# Add another layer `geom_smooth(method="lm", se=FALSE)`
state_data_2017 %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw() +
geom_smooth(method="lm", se=FALSE)
Now using the formula, we will regenerate the intercept and slope
coefficient estimates from OLS regression.
Slope coefficient
\[\hat{\beta}_1 = \dfrac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2}\]
Intercept
\[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \]
# You can use `lm()` to perform ordinary least squares regression analysis.
?lm()
# "A typical model has the form response ~ terms
# where response is the (numeric) response vector
# and terms is a series of terms which specifies a linear predictor for response."
# In other words, the dependent variable comes first.
# Then we use `~` to separate it with independent variables.
# If we were to run a regression when the outcome variable is death_rate
# and the independent variable is poverty_rate, then it looks as follows:
lm(death_rate ~ poverty_rate, data=state_data_2017) %>% summary()##
## Call:
## lm(formula = death_rate ~ poverty_rate, data = state_data_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1801 -8.3651 -0.8157 6.5827 30.1764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0379 6.3711 3.145 0.00282 **
## poverty_rate 0.1979 0.5110 0.387 0.70019
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.52 on 49 degrees of freedom
## Multiple R-squared: 0.003052, Adjusted R-squared: -0.01729
## F-statistic: 0.15 on 1 and 49 DF, p-value: 0.7002
# Intercept estimate is reported to be 20.0379 using lm function
# Slope estimate is reported to be 0.1979 using lm functionThis linear regression model is examining the relationship between the death_rate (dependent variable) and poverty_rate (independent variable) using the state_data_2017 dataset.
Intercept (Constant Term): The intercept is 20.0379. This means that when the poverty_rate is zero, the expected death_rate is approximately 20.0379. However, a zero poverty rate might not be realistic, so this interpretation should be made with caution. The coefficient for poverty_rate is 0.1979. This suggests that for every one-unit increase in the poverty rate, the death rate is expected to increase by 0.1979 units, all else being equal. However, this relationship is not statistically significant (check the stars after the number, if there is no stars, then it is not significant).
The p-value for poverty_rate is 0.70019, which is much higher than the common significance levels (0.05 or 0.01). This means that the relationship between poverty rate and death rate in this model is not statistically significant.
The Multiple R-squared value is 0.003052, indicating that only about 0.3052% of the variability in the death_rate is explained by the poverty_rate. This is a very low value, suggesting that the model does not explain much of the variation in the death rate.
The Adjusted R-squared value is -0.01729, which further confirms the limited explanatory power of this model, considering the number of predictors and the sample size.
In summary, while the model suggests a slight increase in death rate with an increase in poverty rate, this relationship is not statistically significant based on the data from 2017. It indicates that poverty_rate is not a strong predictor of death_rate in this specific model and dataset. Additional variables or a different modeling approach might be needed to effectively predict or understand the factors influencing the death.
Unsolicited life advice: You should
make an ECON607 folder where you can store your work in one
place. Also, you should pick file and folder names without spaces. Why?
R often chokes on spaces in file names when loading data. In place of
spaces, you can use underscores (_) or dashes
(-).↩︎
Some people use = as an alternative to the
assignment operator <-. If you want, you can too. Just
be consistent.↩︎